Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.12 - Performing Joins (clean one)")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

pets = spark.createDataFrame(
    [
        (1, 1, 'Bear'),
        (2, 1, 'Chewie'),
        (3, 2, 'Roger'),
    ], ['id', 'breed_id', 'nickname']
)

pets.toPandas()

	id	breed_id	nickname
0	1	1	Bear
1	2	1	Chewie
2	3	2	Roger

breeds = spark.createDataFrame(
    [
        (1, 'Pitbull', 10), 
        (2, 'Corgie', 20), 
    ], ['id', 'name', 'average_height']
)

breeds.toPandas()

	id	name	average_height
0	1	Pitbull	10
1	2	Corgie	20

Performing Joins

There are typically two types of joins in sql:

Inner Join is where 2 tables are joined on the basis of common columns mentioned in the ON clause.

ie. left.join(right, left[lkey] == right[rkey])

Natural Join is where 2 tables are joined on the basis of all common columns.

ie. left.join(right, 'key')

source: https://stackoverflow.com/a/8696402

Question: Which is better? Is it just a style choice?

Option 1: Inner Join (w/Different Keys)

join_condition = pets['breed_id'] == breeds['id']

df = pets.join(breeds, join_condition)

df.toPandas()

	id	breed_id	nickname	id	name	average_height
0	1	1	Bear	1	Pitbull	10
1	2	1	Chewie	1	Pitbull	10
2	3	2	Roger	2	Corgie	20

What Happened:

We have 2 columns named id, but they refer to different things.
We can't uniquely reference these 2 columns (easily, still possible).
Pretty long join expression.

This is not ideal. Let's try renaming it before the join?

Option 2: Inner Join (w/Same Keys)

breeds = breeds.withColumnRenamed('id', 'breed_id')
join_condition = pets['breed_id'] == breeds['breed_id']

df = pets.join(breeds, join_condition)

df.toPandas()

	id	breed_id	nickname	breed_id	name	average_height
0	1	1	Bear	1	Pitbull	10
1	2	1	Chewie	1	Pitbull	10
2	3	2	Roger	2	Corgie	20

What Happened:

We have 2 columns named breed_id which mean the same thing!
Duplicate columns appear in the result.
Still pretty long join expression.

This is again not ideal.

Option 3: Natural Join

df = pets.join(breeds, 'breed_id')
df.toPandas()

	breed_id	id	nickname	name	average_height
0	1	1	Bear	Pitbull	10
1	1	2	Chewie	Pitbull	10
2	2	3	Roger	Corgie	20

What Happened:

No duplicated column!
No extra column!
A single string required for the join expression (list of column/keys, if joining on multiple column/keys join).

Summary

Preforming a natural join was the most elegant solution in terms of join expression and the resulting df.

NOTE: These rules also apply to the other join types (ie. left and right).

**Some might argue that you will need both join keys in the result for further transformations such as filter only the left or right key, but I would recommend doing this before the join, as this is more performant.

Section 2.12 - Performing Joins

Library Imports

Template

Performing Joins

Option 1: Inner Join (w/Different Keys)

Option 2: Inner Join (w/Same Keys)

Option 3: Natural Join

Summary

results matching ""

No results matching ""